Random effects compound Poisson model to represent data 

with extra zeros 



Marie-Pierre Etienne*' a ' b , Eric Parent a,b , Hugues Benoit , Jacques Bernier 3 

a AgroParisTech, UMR 518, F-75000 Paris, France 
b INRA, UMR 518, F-75000 Paris, France 
c Fisheries and Oceans Canada, Moncton, New Brunswick, Canada 



Abstract 

This paper describes a compound Poisson-based random effects structure for mod- 
eling zero-inflated data. Data with large proportion of zeros are found in many 
fields of applied statistics, for example in ecology when trying to model and pre- 
dict species counts (discrete data) or abundance distributions (continuous data). 
Standard methods for modeling such data include mixture and two-part conditional 
models. Conversely to these methods, the stochastic models proposed here behave 
coherently with regards to a change of scale, since they mimic the harvesting of a 
marked Poisson process in the modeling steps. Random effects are used to account 
for inhomogeneity. In this paper, model design and inference both rely on conditional 
thinking to understand the links between various layers of quantities : parameters, 
latent variables including random effects and zero-inflated observations. The poten- 
tial of these parsimonious hierarchical models for zero-inflated data is exemplified 
using two marine macroinvertebrate abundance datasets from a large scale scientific 
bottom-trawl survey. The EM algorithm with a Monte Carlo step based on impor- 
tance sampling is checked for this model structure on a simulated dataset : it proves 
to work well for parameter estimation but parameter values matter when re-assessing 
the actual coverage level of the confidence regions far from the asymptotic conditions. 
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i 1. Introduction 



2 Often data contain a greater number of zero observations than would be predicted 



3 using standard, unimodal statistical distributions. This currently happens in ecol- 

4 ogy (see (l6j ]) when counting species (over-dispersion for discrete data) or recording 

5 biomasses (atoms at zero for continuous data). Such data are generally referred to 



e as zero-inflated data and require specialized treatments for statistical analysis [12 . 
7 Common statistical approaches to modeling zero-inflated data make recourse either 
s to mixture models, such as the Dirac function for the occurrence of extra zeros in 



9 addition to a standard probability distribution (see for instance [21] ), or to two-part 

10 conditional models (a presence/absence Bernoulli component and some other distri- 

11 bution for non zero observations given presence such as in 25[). These models are 



12 well-known |4| and offer the advantages of separate fits and separate interpretations 

13 of each of their components. Parameters are well understood and interpreted as the 

14 probability of presence, and the average abundance of biomass if present. 

15 However, a major flaw of those models is their non-additive behavior with regards 



i6 to variation in within-experiment sampling effort [26(. Consider for instance the 

n fishing effort measured by the ground surface swept by a bottom-trawl during a 

is scientific survey of benthic marine fauna. If during experiment i, observation Y{ is 

19 made with some experimental effort corresponding to the harvesting of some area 

20 Di and is assumed to stem from a stochastic model with parameters 6(Di), then the 

21 additivity properties of coherence are naturally required: if we consider two (possibly 

22 subsequent) independent experiments i and i' on the different non overlapping areas 

23 Di and Dp, we would expect that the random quantity Yj + Yy stems from the same 

24 stochastic model with parameters QiDi U Dy). A compound Poisson distribution is 

25 a sum of independent identically distributed random variables in which the number 

26 of terms in the sum has a Poisson distribution. Compound Poisson distributions are 

27 candidate models purposely tailored to verify the previous desired infinite divisibility 

28 property since the class of infinitely divisible distributions coincides with the class of 

29 limit distributions of compound Poisson distributions theorem 3 of chapter 27). 

30 Depending on the nature of the term in the random sum, the compound distribu- 

31 tion can be discrete or continuous. The construction of such a compound distribution 

32 with an exponential random mark for continuous data and with a geometric one for 

33 counts is recalled in section 2. This approach is worthwhile for two reasons. The 

34 first is parsimony : there is only one parameter for the Poisson distribution plus an 

35 additional one for the probability distribution function - pdf - of each component 

36 of the random sum. Secondly, the compound construction may assist our under- 

37 standing in cases where the data collection can be interpreted in terms of sampling a 

38 latent marked Poisson field. That is to say that the data appear in latent "clumps" 
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39 that are "harvested" during the experiment, the Poisson parameter being the pres- 

40 ence intensity of such clumps. A random variable is used to mimic the quantity (or 
the number of individuals in the discrete case) independently in each clump. At 
the upper level of the hierarchy, random effects are added to depict heterogeneous 
conditions between blocks of experiments. 

In section 3, we develop a stochastic version of the EM algorithm [3] with a 

45 Monte-Carlo step (using importance sampling) for this non Gaussian random effect 

46 model with zero-inflated data. Maximum likelihood estimates and the correspond- 

47 ing variance-covariance matrix are derived. The computational task remains rather 

48 tractable thanks to simplifying gamma-exponential conjugate properties in the con- 

49 tinuous case (and beta-geometric conjugacy in the discrete case). 



In section 4, the hierarchical model 29(] with compound Poisson distribution for 



51 zero-inflated data is exemplified using a real case study with two marine species, 

52 urchin and starfish abundance data from a scientific bottom-trawl survey of the 

53 southern Gulf of St. Lawrence, Canada. The EM algorithm performs well in obtain- 

54 ing the maximum likelihood estimates of parameters, but for one of the two species 

55 we notice some discrepancy between the actual coverage of the confidence intervals 

56 and their theoretical levels (as given by the asymptotic normal approximation). Con- 

57 sequently, we further focus on variance covariance matrix estimation in section 5 and 

58 investigate via simulation the behavior of coverage level of confidence intervals for 

59 various experimental designs, in search of a practical fulfillment of the asymptotic 
eo conditions. Finally, we briefly discuss some inferential and practical issues encoun- 
6i tered when implementing such hierarchical models for zero-inflated data. 



62 2. Model construction 

63 We propose a hierarchical construction to represent data with extra zero collected 

64 over a non-homogeneous area. The model is divided into two main layers : in the 

65 first one, we model the sampling process within a homogeneous sub-area (strata) 
ee and in the second layer, we introduce heterogeneity between strata at the top of 
67 the hierarchy using random effects. The first subsections detail the hierarchical 
es constructions for continuous data. In the last subsection 12.41 we sketch out an 

69 obvious modification to represent count data. 

70 2.1. Compound Poisson process to introduce extra zeros 

71 Imagine that data Y are obtained by harvesting an area D and that there are 

72 some clumps distributed according to an homogeneous Poisson process : clumps are 

73 uniformly distributed with a constant intensity, say fi. 
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By harvesting an area D, we pick an integer- valued random variable N of clumps. 
According to Poisson process property N follows a Poisson distribution of parameter 
fi D. For each clump i the independent random variables Xi or marks (with the same 
probability distribution) represent for instance the possible biomass in each clump 
to be collected. 

The final return will consist of the sum over N clumps of the amount contained in 
each clump. With the convention that Y = if N = , the random sum : 

N 
i=0 

74 is said to follow a compound Poisson distribution. Figure [1] exemplifies a realization 

75 of the total amount of a collect (i.e., sum of the marks) in a sampled region D. 

76 

77 The Poisson-based additivity property avoids the drawback of classical models 

78 mentioned in the introduction. Generally, D is the area of the sampled area included 

79 in IR 2 . We assume an homogeneous region fi(D) = fiD, so that the expected number 
so of collected clumps is proportional to the catching effort. The difficulty with the 
si generalization to an inhomogeneous Poisson process lies in the inference step, not in 

82 the modeling step. Consequently we used another approach to deal with heterogene- 

83 ity (see section [273]) . In the following, we mostly omit to index quantities with this 

84 catching effort for presentation clarity, explicitly mentioning it only when necessary. 

85 Summary statistics about such compound distribution Y are easily obtained (the 
se characteristic function is given in appendix [A} : 



E(Y) =fiDE(X) 
Yar(Y) =fiDE(X 2 ) 

Parameter /x rules the occurrence of zero values when assuming ¥(X = 0) = 
i.e. that the random mark is non atomic at : 

P(y = 0) = exp (-//£>). 

2.2. Choice of the random component X for continuous data 

For real-valued data with extra zeros, we will concentrate in this paper on the 
exponential distribution of parameter p for component X such that E(X) = p _1 , 
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total collected biomass - 7.7 



sampled surface 



collected clumps 




Figure 1: Realization of a marked Poisson process on a region of R 2 , the sample is conducted over 
a region D. Here the total catch is y = 7.7, the effective number of collected clumps is 8. 
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leading to 



E(Y) 



and ¥ar(Y) = 2 



P P 
To keep on with an ecological interpretation of the model, assuming that the mark 
X follows an exponential distribution of parameter p, means for the biologist that 
the probability of finding a large amount of biomass within a clump is exponentially 
decreasing and that the average quantity in each clump is p~ l . When no clump is 
collected, there occurs a zero for the model Y. We choose the exponential distribution 
because of parsimony and because of its interesting conjugate property detailed in 
section 13.1.21 

This compound Poisson distribution was termed law of leaks (LOL) by 0], where 
X represents elementary unobserved leaks occurring at N holes (uniformly located) 
along a gas pipeline. In summary : 



{Y~LOL(ji,p)) 



\ (X U ...,X N ) l ~ d S(p) J 



[2.2) 



97 For the discrete case, a similar definition holds with the corresponding geometric 

98 distribution for the marks (see section 12711) . 
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2.3. Random effects 

Although the previous compound construction could have formally been extended 
to non-homogeneous Poisson processes, it is easier but still quite realistic to relax the 
assumption of homogeneity by considering homogeneous blocks (or strata), modeling 
possible inter-block dispersion using random effects. We consider S blocks ; in a given 
block s there are I s grouped observations. We denote by Y = (Y s x, . . . , Y s j s ) the 
random vector in block s and by Y = (Yi, . . . , Y$) the whole vector over the S blocks. 
The coefficients a and b of the gamma pdf T(a, b) for a random variable /i are such that 
E(/-0 = f an d Yar(p) = A. The random effect model RLOL(a,b,c,d) representing 
the occurrence of the sample Y is defined by the following set of equations. 




,Ps) r(a,6), 

ps) J ~ d r(c, d), 



Y ~ RLOL(a, b, c, d) 

Y s j s | ii s , p s ~ LOL (fx s D s>k , p s ) Vs G {1, . . . , S} 

(2.3) 

100 The choice of a gamma distribution for the random effect is motivated by conjugate 

101 properties which are useful in the inference of the model. Section 14.1.31 will show 

102 that it may also be quite a realistic distribution for some datasets. The hierarchical 

103 construction is summed up by the directed acyclic graph (DAG as termed by [231 ]) 

104 in Figure [2j 
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Figure 2: DAG of the RLOL model 



105 2.4- Compound Poisson process for count data 

we A similar but discrete version to model count data, can be obtained by changing 

107 the nature of the random marks of the Poisson process. In this paper, we study a 

los geometric distribution with parameter p = ¥(X = 1). The core of the model is thus 

109 given by the following compound Poisson process with geometric marks : 
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(Y ~ DLOL(n,p)) 



\ (x u ... : x N ) l ~g( P ) J 

To preserve conjugate properties, the gamma distribution for the random effect 
on the marks is replaced by a beta distribution so that the count data version of the 
model is given by : 



Y ~ RDLOL(a, b, c, d) 



Oi, . . . ,/i 5 ) l ~ d r(a,6), 
(pi,...,Ps) ~ P(c,d), 



I Vs,p s ~ DLOL(fi s D Stk ,p s ) Vs e {1, . . . 

(2.4) 

no where DLOL means Discrete version of Law of leaks and RDLOL discrete law of 
in leaks with random effects. 

n2 In most of the paper, we will simply state the main results when technical aspects 
n3 of the proofs are shared between discrete and continuous cases. 



S} 



H4 3. Estimation via the EM algorithm with importance sampling 

H5 Hierarchical models such as 12.31 or 12.41 cannot be straightforwardly estimated 

lie because of the latent variables. The random effects (/i, p) and the unknown numbers 

in of clumps N must be integrated out to obtain the likelihood. The likelihood has no 

us closed form and estimators cannot be directly derived. In such classical 

n9 strategy is to use Expectation Maximization algorithm (0) to derive max-likelihood 

120 estimates. In our case the E step is not analytically accessible. An alternative is to 

121 use a stochastic version of this EM algorithm such as Monte-Carlo EM ( MCEM see 
or or stochastic approximation of EM (SAEM see 



122 



123 We detail in this section how to implement a MCEM algorithm using Importance 

124 sampling to obtain the maximum likelihood estimation and its empirical variance 

125 matrix. Similar results concerning count data process are summed up in the last 

126 subsection. From this point onwards we will use brackets to denote pdf's as many 

127 conditioning terms will appear in the probabilistic expressions derived from the model 

128 fully specified by the set of equations (12.31 ). The brackets denote either a density or a 



129 discrete probability distribution, as in [lfj. Following Bayesian conventions, we will 



no also allow the parameters to appear as conditioning terms (i.e., instead of writing 

131 F(X) we will specify [X\a, b, c, d]) so as to help the reader understand which layer of 

132 the hierarchical model (12. 3p the probability expression refers to (see Fig [2]). 
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133 3.1. Implementation of the MCEM algorithm 

134 In this paper, 9 stands for the set of parameters (a, b, c, d) in the RLOL model. 

135 Given the random effects, the data within a block are independent : 

s 

8=1 

136 where L s denotes the complete log-likelihood in block s, i.e. : 



L s = L^Ys, Ns,n s , p s ) = ( ln([y S|i |A^,p s ] [N.M) J + (3.1) 

;=iis 

\n([fj, s \a,b}) + ln([p s |c, d}) 



137 Following 28j], the pivotal quantity in the EM algorithm (recalled in appendix 



138 [Dj is the conditional expectation of the complete log-likelihood 

Q(8,8')=E e , (£(0;Y,N,M|Y) 

139 3.1.1. Maximization step 

wo To maximize Q(9, 9') with respect to 9, we focus on the terms that involve 9 : 

s s 
Q(9,e') = C_ e (F) + (a-l) x Y, E e> (ln/* fl \ Y) + Salnb - bj^^e' I ^} - 5 ln(r(a)) 

S=l 8=1 

5 5 
+ (c-l) x^E 9 , (lnp s \Ys) +Sclnd-dJ2^e' {ps \ Y) - 51n(r(c)), (3.2) 



141 where C_e(y) denotes a constant which does not depend on 9. 

142 Differentiating with respect to 9, we obtain the set of equations to be satisfied at 

143 the maximum argmax Q(9, 9'): 



s 



a 



(3.3) 



s=l 



b S 
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In a — ip(a) = In 



s=l 



s 



\ 



s=l 



S 



(3.4) 
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149 
150 
151 
152 
153 
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155 
156 
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c 

d 



s 



( 



In c — ip(c) = In 



Y^e' {Ps I Ys) 



\ 



S 



\ 



S 



(3.5) 



(3.6) 



/ 



144 ijj(x) denotes the digamma function defined as the first logarithmic derivative of 

145 T(x). No analytical expression can be derived for 9 as the argument of the maximum 
we of Q(6, 6'), but a Newton- Raphson algorithm is efficient and easy to implement with 

147 a good empirical starting point as indicated in annex [Bl 



3.1.2. Expectation step by conditioning onto the number of clumps 

The right-hand side of equations 13 .31 to [331 involves (yU s | Ys), (hi(/i s ) | Y s ) , 
Eg/ (p s I Ys) and Eg/ (ln(p s ) | Y^) • To compute these expected values, we will proceed 
by conditioning onto the hidden number of clumps N. Proposition 13.11 shows that, 
given N, these four target quantities are simply marginal expectations of the sufficient 
quantity N s+ , the only necessary function of N that needs to be evaluated within 
each block s. 

In a second step, integration over the number of clumps is performed by recourse 
to importance sampling within a block s as detailed in proposition 13.31 Proofs of 
propositions are given in appendix [E] 



158 Proposition 3.1. Assuming Y ~ RLOL(8') with 9' = (a 1 ,b' ,c' ,d') , S strata and I s 

159 records in stratum s as in \2.3\ . then the complete conditional distributions of p s and 
wo p s in one particular stratum s are given by 



/is|N,Y, 6' ~ T(a + N s+ , b' + D 



s+J, 



(3.7) 
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161 and 



p 8 \N, Y, & ~ r(a' + iV s+ , 6' + F s+ ), (3.8) 

162 where in stratum s, N s+ = YliLi N S i denotes the total number of clumps caught, 

163 Y s+ = Xl[=i ^ s ^ e entire quantity harvested and D s+ = Yli.Li D S i is the whole 

164 catching effort. 

165 The quantities involved in the E step are given by 



Eg, (p s 


1 ?/*) 


a' + E e , (Ns+lys) 




(3.9) 


b' + D s+ 




Eg, (ln{p s ) 


1 J/*) 


=E e , {%l>{a! + N s+ ) \ys) 


-\n(b' + D s+ ), 


(3.10) 


E<?' (p s 


1 S/a) 


c' + Eg, (iv s+ y 




(3.11) 


d' + Y s+ ' 




Eg, (ln(p s ) 


1 Vs) 


=E„, (>((/ + N s+ ) \y,) 


-\n(d' + Y s+ ). 


(3.12) 



we This result merely comes from the conjugacy property between gamma and Pois- 

167 son distributions for p (gamma and exponential distribution concerning p). The 

168 moments of gamma and log gamma, beta and log beta distributions are recalled in 

169 appendix O 

no In order to go one step further into the calculus, we have to perform the integra- 

ni tion over N + . Proposition 13.21 gives the distribution of N + \Y + ,9 up to a constant. 

172 Subsequently, the integration over N + will make recourse to importance sampling as 

173 proposed in [15| . This Monte Carlo algorithm is detailed in proposition 13.31 

Proposition 3.2. Assuming Y ~ RLOL(a,b,c,d) with S strata, and I s records in 
stratum s, the conditional distribution of N s \9,y s is given (up to a constant K) by 

{ftd^wty^ . p U* W i W ,+i)J . n *w 

x i=l,yi>0 x «=l,j/i=0 

(3.13) 

174 To draw a sample according to the rather intricate looking distribution I3.13[ an 

175 importance sampling based algorithm is detailed in the following proposition for one 

176 replicate (often termed particle). In order to obtain a G-sample, this procedure is 

177 repeated for each block G times. 
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178 Proposition 3.3 (Generate one particle in one particular stratum s according to dis- 

179 tribution l3.13l) . A particle g is a vector (ivj+, N^ 9 \ . . . , A^jf ) in a particular stratum 
wo s. Omitting s to make the reading easier, we may assume with no loss of generality 
i8i that the first I + terms are non zero and the I — 1 + followings are the zero ones. The 

algorithm to generate one particle g runs as follows: 



182 



1. Generate = wherever y i=0 for i = I — I + + 1, . . . , I. 

2. Generate the value of the random sum according to the importance distri- 
bution : 



1 \ N+ f y+ \ N+ T(a' + N + )T(c' + N A 



V + D+J \d' + Y+J ( I ]£ 1 T{frN+ + l))T(N + -I + + i) 



fis(N + ) cx 



185 As the one dimensional importance distribution fjg is a quickly decreasing func- 

186 tion of N + , its normalizing constant can be easily approximated and a bounded 

187 interval is used in practice as the support of N + . 

188 3. Generate each iV^ ) for i = 1, . . . , 1+ so that the vector (iV 1 (s) - 1, ... , N$ - 1) is 

189 distributed according to a multinomial distribution A4(N^ (yi/Y + , . . . , yj+ /Y + 

4- Associate to the vector (N+ , N[ 9 \ . . . , Nj 9 ^) generated at the previous step, the 
importance weight : 

i+ r(jv! s) £ + i) 



i= i iW + i) 



190 The proof of this proposition is straightforward from importance sampling theory 

191 (see for instance chapter 3 of [22|). 

192 

The weighted sample of N + may be used to approximate the expected conditional 
value defined in equations 13.91 to 13.121 For instance, quantity 13.101 is approximated 
by : 

Ev (ln(Ai.) I ys) « ( a 1 - " W x W + N s+)) ~ Mb' + D s+ ). 



H u 1 ^ {9) 



9=1 
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193 3.1.3. Empirical Variance Matrix 

194 This section is devoted to the evaluation of the empirical variance matrix, so 

195 as to provide confidence regions. Because of the EM principle, we assume that the 
we algorithm has converged to the maximum likelihood value 6. The empirical Fisher 

197 information matrix is then given by proposition 13.41 To explicitly compute this 

198 information matrix, we propose to numerically integrate over N thanks to importance 

199 sampling as performed for the point estimation step. Technical details are also given 

200 in appendix [Fj 

Proposition 3.4. Assuming Y ~ RLOL(a, b, c, d) with S strata, and I s records in 
stratum s as in \2.3\ Let us denote I e (6) the empirical information matrix defined by 



W) 



2 ln[Y|0] 

d6 4 06, 



(3.14) 



At the maximum likelihood estimator 6, the following equality holds 

i 



s 



i 

1 


V o 



■S- 

b 2 








y(c) 
i 



\ 



1 



J 



s 



(3.15) 



with 



and 



o 



V 



o 












\ 





o K.m<$) y 



( Yar v MO) 

Cov„ a {a*,ip{a k s ) 



Yar„ a (a*) Cov Vs (a*,ip(c*) Cov Va {a*,c*) 

bf b* b* d* 

Cov Ua (a*,ip{c*)) Yar u (lp(c*)) <C<«v s (c*,^(c*)) 

Cov Va (a*,c*) Cov Va (c*,ip(c*) Var Vg lc*) 



where a* = a + N s+ , b* = b + D s+ , c* = c + N s+ , d* = d + Y s+ and u s stands for 
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204 the probability measure of N s+ \6, Y. 



205 As for the first derivative phase of the EM algorithm detailed in section 13.31 the 

206 operations E N ^ Y and Var N |g Y , needed to evaluate A s and B s , can be easily imple- 

207 mented by recourse to the very same Monte-Carlo N + sample that was previously 

208 drawn by importance sampling. 

209 3.1.4- Prediction of the random effects 

210 It is of interest to predict the random effects in each stratum, for instance to 
2n help illustrate the heterogeneity between units. In a linear mixed model context, 

212 the Best Linear Unbiased Estimator is defined by the conditional expectation of the 

213 random effect according to the data y and the point estimation. We follow the same 

214 avenue of thought and define a predictor of the random effects by the conditional 

215 expectation. Using formula [3791 and [3.111 the random effect predictors are given by : 

a a + E(N a+ \y 8 ,§) 

^ ed) = Hns\y,e) = v ~ y , (3.16) 

b + D s+ 

216 and 

, . c + E(N s+ \y s ,9) 

p^ ed) = HPs\y,e) = > - ~ ; , (3.17) 

d + Y s+ 

217 The following section aims at highlighting the differences between the continuous 

218 case detailed previously and the discrete one. 

219 3.2. MCEM algorithm for RDLOL model 

220 3.2.1. Straightforward transposition to the discrete case 

221 The definition of the model designed for the discrete case and called RDLOL 

222 model is given by equation 12.41 in this case the pivotal quantity Q(9, 9') reads : 



,9') =C-g(Y) + (a-l)^TEe, (ln/z s \ Y) + Sahib - bJ^Eg, Y. ) - Shx(T(a)) 

s=l s=l 

s s 

+ in ( r ( ( c C )r(dj ) + (c ~ l) S Ee ' ( 1r p° \Y<d + (d-i)J2 Es ' H 1 - p>) I ^) ( 3 - 18 ) 



The equations satisfied at the maximum for (a, b) are again I3~3l and [3T41 Due to 
the substitution of a gamma pdf into a beta pdf for the random effects governing 
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the geometric discrete marks in the random sum of counts, parameters c and d verify 
equations 13.191 and 13.201 (equivalent to equations 13.51 and 13.61 in the continuous data 
model) : 



^(c + d) - V(c) = - 



£]E„(liip, |Yj) 

s=l 

s 

Ps 



8=1 



In 



Ps 



Y 



S 



(3.19) 



(3.20) 



224 The approach used for the continuous case is reproduced to obtain, in each stra- 

225 turn s the conjugate conditional density of fi, p , so that the analog to propositions 

226 13.11 and 13.21 is : 

Proposition 3.5. Assuming Y ~ RDLOL{9') with 9' = (a',b',c',d'), S strata and 
I s records in stratum s as in \2.J\ , then the complete conditional distributions of /j, s 
and p s in one particular stratum s are given by 



and 



fis\N s+ , & ~ r(a' + N s+ , b' + D s+ ), 
Ps\N s+ , 6' ~ p( c ' + N s+ , d! + Y s+ - N s 



(3.21) 
(3.22) 



Furthermore the conditional distribution function of N s is 



Ns]9',Y] oc 



n 



Y sl - 1 
N, t ~ 1 



D 



\ 



V 



n 6 ( n ^ 
j=i-i++i 



r(a' + N S+ )T(N S+ + c')T{Y+ - N s+ + d') 



{V + D S+ ) N *+ 



(3.23) 



228 The choice of an efficient importance sampling distribution in the discrete case is 

229 not the straightforward adaptation of the continuous gives and a mixture has to be 

230 used to obtain an efficient and well behaved algorithm, detailed in appendix [HI 

231 3.2.2. The covariance matrix in the discrete case 

232 The covariance matrix in the discrete case benefits from the same conditional 

233 independence decompositions and the adaptation of the continuous case is straight- 

234 forward given the moments of the beta distribution in appendix [Cj the result is 
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235 detailed in appendix [G] The weighted sample of N + is used to compute the expec- 

236 tations and variance-covariance terms in the matrix components. 



237 3.2.3. Prediction of the random effects 

238 The predictions of the random effects are just given by the conditional expecta- 

239 tions. Unsurprisingly, the predictions in the discrete case and in the continuous one 

240 look very similar. fjf red) is still given by formula 13.161 and 

„ c + E(N a+ \y„9) 

p<r ed) = E(p.|y, 9) = - V ~ J , (3.24) 

c + d + Y s+ 



241 4. Applications 

242 In this section, we apply the EM estimation procedure to two real datasets of eco- 

243 logical interest. We then study the validity of asymptotic assumptions by assessing 

244 the coverage level of confidence regions. 

245 Jf..l. Real dataset - Gulf of St. Lawrence survey 

246 A multi-species bottom-trawl survey of the southern Gulf of St. Lawrence (NW 

247 Atlantic) has been conducted each September since 1971. The purpose of this survey 

248 is to estimate the abundance and characterize the geographic distribution of marine 

249 biota. The survey follows a stratified random design, with 38 strata defined largely 

250 as homogeneous habitats using depth, temperature and sediments properties. The 

251 target fishing procedure at each fishing station is a 30-min straight-line tow at a speed 

252 of 3.5 knots (i.e., 3.21km trawled distance). However the actual distance trawled can 

253 vary due to winds, currents and the avoidance of damaging rough bottoms; sampling 

254 effort is therefore variable among trawl tows, but this source of additional variability 

255 is easily accommodated in the models presented here ( the D s k in eq 12.31) . For our 

256 case study, we use data on the abundance of sea urchins and Sunflower starfishes 

257 collected during three survey years (1999-2001), in a total of 540 bottom-trawl sets. 

258 The time period was chosen to minimize inter- annual changes in abundance while 

259 ensuring a sufficient sample size. The species were selected because inter-annual 

260 changes in their geographic distribution resulting from movements of individuals at 

261 the scale of survey sampling can be assumed to be approximately nil. 

262 The histograms of urchin and starfish catches in kg per survey tow clearly reflect 

263 zero-inflated distributions (Fig [3] and H|). A large number of tows capture no urchin 

264 (nor starfish) and catches in non-zero tows tend to follow a skewed distribution. At 

265 the scale of the survey, sea urchins are distributed in patches of localized variable 
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abundance, interspersed by numerous and relatively large areas where the species is 
absent (FigEj). Such patchy distributions of organisms are prevalent in ecological 
science. Data in two strata are always zero, thus rendering estimation impossible if 
we were to fit one model per stratum or to consider p s as fixed effects. Because the 
hierarchical framework allows some transfer of information between strata, the other 
data help to predict p in these two strata. 



Urchins Biomass empirical distribution 



Sunflower starfishes Biomass empirical distribution 




2 3 4 5 
Biomass (log scale) 



kflfll 



~i r 
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Figure 3: Histogram of urchin biomass 
(kg/tow) from individual tows in the south- 
ern Gulf of St. Lawrence, bottom-trawl sur- 
veys: 1999-2000-2001 



Figure 4: Histogram of Sunflower Starfishes 
biomass (kg/tow) from individual tows in the 
southern Gulf of St. Lawrence, bottom-trawl 
surveys: 1999-2000-2001 



272 Jf.A.1. Maximum likelihood point estimation 

The estimation procedure follows the EM algorithm detailed in appendix ID1 (with 
a stopping rule when the sixth decimal does not change between iterations) and gives 
values of 

§ Urch = (a, 6, c, d) = (0.997797, 1.05107, 5.05733, 13.0312), 

and 

6 Sun = (a, b, c, d) = (1.91879, 1.80704, 1.90002, 0.898734), 

273 as a maximum likelihood point estimates respectively for Urchin and Sunflower 

274 starfishes datasets. 

275 A visual diagnosis of the goodness of fit is very informative. According to the 

276 RLOL model, data are drawn from a mixture and we cannot add directly a density 
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Figure 5: Locations of urchin catches (symbols) and stratum boundaries (lines) in the southern Gulf 
of St. Lawrence bottom-trawl surveys 1999-2000-2001. The radii of the circles are proportional to 
the biomass (in kg/tow) caught. The "*" denote sites with no urchins caught. Starfishes are not 
plotted. 



277 line on the histograms of figures [3] and H] since the zero ordinate of these figures is 

278 somewhat artificial : it depends on the width of the histogram bins and has been cho- 

279 sen so that the overall cumulative greyed surface is 100%. The expected histograms 

280 presented in figures [6] and [7] have been obtained using 1000 replications of the model 

281 with the same design at 9, and averaging the 1000 generated histograms. Obviously 

282 the obtained model histogram (averaging all the random effects) is smoother than 

283 the empirical distribution. The observed number of zeros falls below the expected 

284 number but within the 90% confidence interval for each species (as indicated by the 

285 vertical line on figures [6] and [7J and the overall shape of the distribution fits quite 

286 well the data in both cases. 
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Urchins : Average Histogram obtained on 2000 simulations 



Average Histogram obtained on 2000 simulations 
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Figure 6: Comparisons between urchins 
dataset and averaged histogram (1000 sim- 
ulations of datasets at Q Urch ) 



□ Real 
■ Expected 



I I 
4 6 

Biomass (log scale) 



10 



Figure 7: Comparisons between Sunflower 
Starfishes dataset and averaged histogram 
(1000 simulated datasets at Sun ) 



287 Jf.,1.2. Confidence intervals 

Relying on proposition 13.41 the asymptotic covariance matrices are evaluated at 
those maximum likelihood arguments : 



( 


var(a Urch 


X) > 




var(b Urch 


X) 




var(c Urch 


X) 


\ 


var(d Urch 


X) J 



Corr{6 Urch ,X 



( 0.0587 \ 
0.1020 
1.6804 
\ 14.4793 / 

/ 1 0.825 

0.825 1 

0.035 0.036 

\ 0.058 0.081 



0.035 
0.036 
1 

0.936 
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0.081 
0.936 
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and 
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/ var(a bun , Y) \ 

var(b Sun ,Y) 
var(c Sun ,Y_) 

V var(d Sun ,Y) J 



Corr{6 Sun ,X 



( 0.2555 \ 
0.3003 
0.2609 
\ 0.0894 / 

/ 1 

0.902 

-0.055 
\ -0.046 



0.902 
1 

-0.056 
-0.023 



-0.055 
-0.056 
1 

0.906 



-0.046 \ 
-0.023 
0.906 

1 / 



Essentially only a and b (resp c and d) are correlated. 

To evaluate the actual coverage of confidence regions in the present sampling 
conditions (that may be far from asymptotics), 16000 simulations were launched, 
assuming the same number of strata and the same number of data points per stratum 
as the urchin catches (resp. sunflower starfishes) with 9 as hypothetic true parameter, 
thus disregarding possible bias. As a practical working conclusions, Figures [8] and 
[9] show how to correct theoretical asymptotical confidence intervals. The results are 
quite different from one dataset to the other. 

• On Urchins dataset, to get an actual 90% confidence region, we must expand as 
far as the asymptotic ellipse corresponding to a 99.964% normal approximation 
as shown in Figure [8j 

• On Sunflower Starfish dataset, things work better and the 94% asymptotical 
confidence interval is quite a good surrogate for an actual 90% confidence re- 
gion! 

To understand Table [U we suggest to consider the median column as the reference 
confidence interval (based on simulation/ EM re-estimation). The right column gives 
bootstrap-l- EM re-estimation. We notice that the Bootstrap approach is completely 
unappropriate for our model. The estimation is clearly biased with a shift to the 
right (verified on simulations not shown here) although we tried to correct bias as 
proposed in [l^]. The width of confidence intervals are underestimated for both 
species and does not even contain the #-value. The hierarchical structure of the 
model may explain part of this bad behavior of bootstrap method but this would 
need further investigations not in the scope of this paper. The left column of Tabled] 
exhibits two different behaviors according to the species considered. 

• The asymptotic variance of maximum likelihood parameters under-estimate 
strongly the true sampling characteristics in the Urchin case. This may be due 
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90% Confidence Intervals 
(asymptotic) (via simulation) (via booststrap) 


Urchins case 


0.587 < a < 1.384 
0.496 <b< 1.547 
2.827 < c < 7.092 
6.387 < d < 18.905 


0.335 < a < 1.637 
0.163 <b< 1.880 
1.476 < c < 8.443 
2.419 < d < 22.872 


0.72 < a < 1.00 
0.61 <b< 1.03 
1.23 < c < 4.37 
1.68 < d < 10.89 


Starfishes case 


1.087 < a < 2.750 
0.905 <b< 2.708 
1.059 < c < 2.740 
0.406 < d < 1.390 


1.294 < a < 2.951 
1.198 < b < 3.141 
1.344 < c < 3.182 
0.558 < d < 1.559 


1.217 < a < 1.859 
0.938 < b < 1.663 
1.147 < c< 2.035 
0.347 < d < 0.858 



Tabic 1: Comparison of the asymptotic 90% confidence interval with the one obtained by simulation 
for each parameter component for both species 

3w to the large numbers of zero's for that species: consequently relatively less non 

315 zero data remain for the p's (inverse of patch abundance) and the estimation 

316 of c and d that rule the between units variation of p's may become difficult. 

317 • The Sunflower Starfishes case exhibits much better properties regarding the 

318 approximation of the covariance matrix. For this species, less zeros data occur 

319 and we guess that enough information is made available in the sample to get 

320 correct estimations. 

321 Figures [TUI and [XT] present the predictions for the random effects in each stratum. 

322 

323 4-1.3. Validation of the gamma assumption for random effects 

324 We have assumed that the random effects \x and p were distributed according 

325 to gamma distributions. This choice was essentially made for technical convenience 

326 because conjugate properties make the estimation easier. The validity of this as- 

327 sumption can be checked by considering random effects as fixed and estimate them 

328 independently in each stratum. Figures [T2] and [T3] present a pp-plot of empirical 

329 versus estimated probability distributions for p and p. 

330 The pp-plot for \i suggests that the gamma distribution is appropriate (Fig 1X21) ; 

331 this is not true of the gamma pp-plot for p (Fig [T3l) . First there are only 36 points 

332 estimates because 2 strata are empty and p's for these strata are not defined. Second 

333 the probability plot does not adjust to a straight 45 degrees line. Looking more 
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Urchins : Confidence intervals 



Starfishes : Confidence intervals 





1 vetsus3 1 vetsus4 

OS 

2 versus 3 2 versus 4 

OS 



Figure 8: The ligthest ellipse corresponds to 
90% confidence ellipsoid and the darkest one 
is 99.96% and contains 90% of the simulated 
values. 



Figure 9: The ligthest ellipse corresponds to 
90% confidence ellipsoid and the darkest one 
is 94% and contains 90% of the simulated 
values. 



334 closely at four extreme points in the p pp-plot, we found that they come from strata 

335 with less than two non-zero data points. Excluding these 4 points produces the much 

336 more acceptable fit of Figure [TU 

337 Simulations Studies 

338 The previous section showed different behaviors depending on the species : the 

339 EM procedure provides rather reliable estimates for the starfish RLOL statistical 

340 features but not for the Urchin ones. The purpose of this section is to check the role 

341 of the sampling designs. Simulation studies are performed to explore the quality of 

342 the EM estimation procedure and to check the actual coverage of the asymptotic 

343 variance-covariance matrix approximation. 

344 Jf.,2.1. Simulation design 

345 For a given set of parameters 9 = (a, b, c, d), we draw 1000 samples according to 

346 RLOL model given in eq 12.31 with a number S of strata and M measured points per 

347 each stratum. S has been chosen varying as k 2 with k = 3,4,5,6,8,10,12,15 and 

348 M = 5,10,15,20,25,30,40. 

349 For each simulation, the estimation procedure depicted in section [3] yields one 

350 point estimate and one estimation of the asymptotic covariance matrix. Assuming 
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Figure 10: Predictions of the random effects 
[i s in each stratum correspond to the ex- 
pected number of clumps collected during a 
measurement with standardcatching effort. 



Figure 11: Predictions of the inverse of p s 
in each stratum. These quantities give the 
expected biomass to be collected within a 
clump. 



PPplot for mu 
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Figure 12: pp-Plot with estimates of [i s versus a fitted gamma distribution. 



that the asymptotic approximation holds and using a normal approximation, con- 
fidence intervals can be given for the true value. As we work within a simulation 
context, the true value is known and one can compute the actual proportion of 
samples for which the asymptotic confidence interval covers the true value. 
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PPplot for rho 
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Figure 13: pp-Plot with estimates of p s ver- 
sus a fitted gamma distribution. The ex- 
tremal points correspond to strat with at 
least 75% of zeros 



PPplot for rho 
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Figure 14: pp-Plot with estimates of p s 
against a fitted gamma distribution after ex- 
cluding the four outliers. 



355 4.2.2. RLOL Results 

356 The simulation study is achieved for two values of parameters 9 corresponding to 

357 the two applications developped in section I4TT1 We choose Q Urchm = (1 7 1 ) 5 ; 13) and 

358 Q Sunstars = (1.9,1.8,1.9,0.9) as true parameter references for the simulations. We 

359 first present a study of the bias and then an investigation of the actual coverage of 

360 confidence intervals. 

361 Bias study 

362 We can study the bias by simulation according to the numbers of strata and the 

363 number of measure points within strata. Figures [TS] and [TBI present the results for 

364 relative bias obtained with 1000 simulations in each configuration. As expected 

365 it decreases quickly with the number of strata and only marginal amelioration is 

366 obtained as soon as the number of data per stratum becomes reasonable. 

36? Confidence intervals study 

368 Using 1000 simulations in each cell, the empirical proportion of the asymptotic 90% 

369 confidence ellipsoids that cover the true value is given in Figures [T7J and [TSJ With 

370 1000 trials in a binomial distribution with probability p of success, a confidence 

371 interval for p = 0.90 is approximatively [88%, 92%] : cells from Figures [17] and [18] 

372 that belongs to that interval have been colored in light grey. Results about confidence 

373 intervals strongly depend on the value of 9. The asymptotic approximation seems 

374 quite satisfying for $ Sunstars • the asymptotical conditions are quickly fulfilled and the 

375 design of the case study seems acceptable. For Q Urchin however, the present design 

376 should be strongly re-enforced (up to 40 points per stratum with 36 strata!) before 

377 yielding acceptable estimations, and confidence regions based on asymptotical theory 

378 are definitely too optimistic. 
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Figure 15: Urchins : Average relative bias in log scale depending on the number of strata and the 
number of measure points. 
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Figure 16: Starfish : Average relative bias in log scale depending on the number of strata and the 
number os measure points. 

379 These two sets of parameter recover two very different situations : the larger 

380 number of zeros in the Urchin case may render the estimation procedure more difficult 

381 than in the Starfish situation. However one should note that the difference is not 
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markedly pronounced : 34% instead of 24%! Such a simulation study shows that 
the quality of variance covariance matrix estimation used to build an ellipsoid of 
confidence behaves has to be checked through this simulation approach by instance 
to verify whether the asymptotic conditions are fulfilled and that the analyst should 
beware of overconfidence. 
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Figure 17: Urchin-like case. Effective pro- 
portion of 90% confidence intervals that 
cover the true value. Shading in particu- 
lar cells reflects the degree of overlap: M-S 
combination that produces confidence inter- 
vals that are too liberal are in black whereas 
the lightest grey shade reflects confidence in- 
tervals that properly characterize parameter 
uncertainty 



Figure 18: Sunstar-like case. Effective pro- 
portion of 90% confidence intervals that 
cover the true value. Shading in particu- 
lar cells reflects the degree of overlap: M-S 
combination that produces confidence inter- 
vals that are too liberal are in black whereas 
the lightest grey shade reflects confidence in- 
tervals that properly characterize parameter 
uncertainty. 



387 5. Conclusion and Perspectives 

388 The following conclusions have been reached: 

389 1. Compound Poisson distributions can conveniently represent the presence of a 

390 large number of zeros and a skewed distribution of non-zero values. To deal 

391 the occurrence of zero-inflated data, very parsimonious models can be designed 

392 (with two parameters only) : a Poisson random sum of independent geometric 

393 random variables in the discrete case and with exponential random variables in 



25 



394 the continuous one. They offer an alternative to the traditionnal delta gamma 

395 models and behave coherently when changing the scale of the catch effort, 

396 thanks to the Poisson process underpinning the model. 

397 2. Compound Poisson distributions can be interpreted using a hierarchical frame- 

398 work. They describe the data collection involved in sampling individuals gath- 

399 ered in (latent) patches drawn from the homogeneous Poisson process with 

400 abundance tuned by the distributional parameter of the random components 

401 of the Poisson sum. The introduction of a random effect structure at the top 

402 of the hierarchy is straightforward and accommodates non homogeneity among 

403 strata that are themselves considered as homogeneous units. Such designs with 

404 random effects and data with extra zeros are commonly encountered in ecolog- 

405 ical analyzes, but gamma random effects are yet rarely advocated : variation 

406 between strata is typically modeled using a normal (or lognormal) distribu- 

407 tion because its sufficient statistics match the commonsense interpretation of 

408 mean and variance. However, gamma random effects allow for partial conjugate 

409 properties with the compound Poisson model for zero-inflated data. Beyond 

410 this theoretical convenience, the parameters of the gamma distribution are well 

411 estimated in the Starfish like simulation examples and they can describe the 

412 entire range of variability between units for the real case study. 

413 3. Independence between the latent features p and p has been a priori assumed 
4w for the random effects between units. This absence of prior correlation is quite 

415 a stringent hypothesis as we might expect p and p to covary (e.g, low non-zero 

416 realized abundance could stem from either a small p or a large p). Working 

417 with a gaussian copula for a joint bivariate distribution for the couple (/i, p) is a 

418 bad remedy, because we would have lost the conjugate properties and increased 

419 computational load. To keep partial conjugacy , a better idea is considering 

420 the natural extension of the gamma family, but such bivariate distributions 

421 are rather restrictive since they can only take into account positive correlation 

422 and need that the two marginals share the same shape parameter. However 

423 such a model would remain parsimonious with 4 parameters: one is gained 

424 to depict correlation and one is lost to depict the marginals' shape. The issue 

425 of correlation has been addressed in 0] who proved via simulation that the 

426 correlation between p and p has little bearing on the property we are ultimately 

427 trying to predict in practice, i.e. the realized biomass in a tow. Finally, the 

428 correlation indicates that the latent variables p and p are model concepts that 

429 should themselves not be overinterpreted; they don't actually characterize the 

430 true size and number of organism patches. 
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431 4. Stochastic EM inferential techniques (with importance sampling for the non 

432 explicit expectation steps) require a modest computational effort since the ran- 

433 dom effects are taken partially conjugate with the compound Poisson distribu- 

434 tions. Auxiliary importance distributions can be proposed by careful inspection 

435 the structure of the joint distribution of the latent variables and integrating out 

436 as much as can analytically be done. Much advantage is taken from conditional 

437 independence, especially when computing the Fisher information matrix by re- 

438 sampling with the simulated missing data that have been previously generated 

439 to evaluate the maximum likelihood estimate. However, the value of results 

440 given here depends on the errors involved with the use of maximum likelihood 

441 asymptotic formula on one hand and on the precision of Monte Carlo sam- 

442 pling algorithms on the other hand. Due to the multidimensional nature of the 

443 latent variables to be simulated , the variability between several trials of the 

444 importance sampling techniques when evaluating the information matrix (and 

445 its inverse) can be important enough, especially when few data makes a rather 

446 flat likelihood function. 

447 5. Asymptotic errors bounds need to be checked and corrected if necessary. We 

448 relied on a simulation study to get a more reliable idea of their ranges. The 

449 simulated sets of zero- inflated data show that, in the Starfish case, one can 

450 readily trust the confidence intervals based on the information matrix while in 

451 the Urchin case, one should beware of being overconfident. The asymptotic 

452 conditions may not be encountered rapidly. For the Starfish case study, the 

453 design allowed a reasonable estimation of the RLOL model features. For the 

454 other species with a 10% higher probability of getting zero values, satisfying 

455 precision estimates with 40 strata need at least collecting 40 data points per 

456 stratum before the confidence coverage gets reasonably close to its theoretically 

457 recommended approximate value. Because 1600 stations represents generally 

458 unrealistically large sampling effort for a marine bottom-trawl survey in that 

459 Urchin example, statisticians need to inform practitioners (before launching 

460 the data collection) about possible underestimation of uncertainty. 

461 6. Covariates for the fixed effect of environmental variable (depth, temperature 

462 and habitat type) could be added to the model, potentially enhancing ecological 

463 interpretation of the observed patterns in organism abundance and distribu- 

464 tion. However, it may bring a lot of additional burden during the inferential 

465 computations since many of the conjugate properties would be lost. For the 

466 same reasons, non exchangeable strata (with for instance an intrinsic CAR 

467 structure on the top of the hierarchy as described in (author?) jil) have not 
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been considered here. Simple (low dimensional) importance sampling should 



be replaced with brute force Hastings Metropolis techniques 11]. In such a 



context, it may be worthwhile to work on encoding prior knowledge 14J into 



)robability distributions and switch the problem into a Bayesian framework 



relying on ready-made tools such as WinBugs for inference [24 



In the case study, the random effect models with compound Poisson distribu- 
tion for the occurrence of zero-inflated data fit the data well and allow transfer 
of information between strata to help predict in data-poor units. Its hierarchi- 
cal structure favors discussion between ecologists and statisticians, and helps 
query its interpretation in term of ecological situations with extra zeros. 
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549 A. Compound Poisson process characteristic function 

When X is real valued, we denote by / the Fourrier transforrrfl of / (i.e the 
characteristic function of X) : 

/H = E{e^ x ) 

From equation 12.11 the compound Poisson distribution g is such that : 

00 n 

fa) = E e " M ^r (/>))" = e_M1_/M) ^ 

n=0 

550 This equation exhibits the infinite divisibility property of Y with regards to pa- 

551 rameter /i, which offers a nice conceptual interpretation when returning to the marked 

552 Poisson process underneath this stochastic construction : the resulting quantity Y is 

553 obtained by collecting a random number of primarily (hidden) batches Xi distributed 

554 at random with intensity \i. Such a conceptual latent process of aggregates would be 

555 intuitive for many ecologists. Conversely, one can easily check by writing the loga- 

556 rithm of their characteristic functions, that traditional models for zero-inflated data 

557 (think for instance of the delta-gamma model or the Zero-Inflated Poisson model such 

558 as j2l[) lack of coherence for adapting to a change of the scale in the experiment. 

Among the many choices for the probability distribution / of the random mark 
of the sum, this paper focuses, for parsimony and realism, on the exponential distri- 
bution for X (continuous case) that is : 

f{x) = pe~ px 

559 so that f(u) = and g(u) = e~^p+'^ . For the discrete case, we suggest the 

560 corresponding geometric distribution : f(x) = l x> o x (1 — r) x r x e lwx leading to 

561 f(u)) = j^-is an d g(w) = e V 1 -' reiw yf or the exponential compound Poisson count 

562 model. 



1 For non negative integer valued random variables X the probability generating function P(z) = 

00 

Pr(X = n)z n is the corresponding machinery for handling discrete distributions : the same 



results can be found in this case by setting the change of variables z = e %u 



31 



563 B. Initialization of the Newton-Raphson algorithm 



The main point on Newton-Raphson algorithm consists in choosing a good initial 
point. In this paper we use this algorithm to find the zero of 

ln(a) - ip{a) - C = 

Note that function ip verifies the following asymptotic series' expansion [l| : 



~ In (a;) 

x— >oo 



x— >oo 



ln(x) 



2x 
1 



n=l 
1 



2nx 



2n 



2x 12 x 1 



120 a; 4 



+ 



564 The convergence is very fast (see Figure [T9|) so that we choose to initiate Newton- 



Raphson algorithm with Xq 



2C 



Figure 19: Difference between log(x) — tp(x) and l/2x 



see C. Computation of the moments of gamma and log gamma, beta and log 
567 beta distribution implied in the expection step 

see C.l. First and second moments for the sufficient statistics of the gamma pdf 

Let Z be a random variable with gamma distribution, Z ~ T(s,t). Using laplace 
transform it is easy to obtain the first moment of ln(Z) : 



E (e xln ^) = E (Z x ) 



r s) 



T{s)t x 
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Differentiating this equation with respect to A, we have the expected value of ln(Z) 
(when A = 0) and Z\n(Z) (when A = 1): 



dE (Z x ) 



E(ln(Z))=^( S )-ln(t), 



A=0 



and 



dE (Z x ) 



A=l 



E (Z ln(Z)) = - {iP(s + 1) - ln(t)) 



569 Taking the second order derivative, we show 



(C.l) 



(C.2) 



d 2 E (Z x ) 



d\ 2 



E (\n(Z) 2 ) = *l)'{s) + tfj(s) 2 -21n(t)^(s) + \n(tf 



A=0 



Therefore the variance-covariance matrix between Z and ln(Z) is : 

s 1 

F 7 



(C.3) 



570 C.2. First and second moments for the sufficient statistics of the beta pdf 
Let S be a random variable with beta distribution S ~ fl(s,t). 



E (e A MS)) 



T(s + t) T(s + \) 
T(s + t + \) T(s) 



So that, by first and second differentiation, one gets, (the derivation is quite straight- 
fully performed if working with InE (e Aln ( 5 ))) : 

E(ln(5)) =ip(s) -ip(s + t) , E(ln(l - S)) = tp(t) - tp(s + t) 
E (\n(S) 2 ) = ip'(s) - ij'(s + t) + - il){s + t)f 

One can extend the properties of characteristic function by considering the func- 
tion of the two arguments A and \x 



E (gAln(S)+Mln(l-S)A 



T(s + t) T(s + X)T{t + fj.) 
F(s + t + \) T(s) T(t) 
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571 By cross-differentiation under regularity conditions (working with InE (5^(1 — S)^ 

572 makes things easier here also) , the joint moment can be analytically obtained : 



d 2 E(S x {l -sy) 



dXdfx 



A=0,/i=0 



E(ln(£) ln(l -S)) 

-i/j'(s + t) + E (ln(5)) E (ln(l - S)) 



Therefore the variance-covariance matrix between ln(S') and ln(l — S) reads 

' ip'(s) -ip'(s + t) -tfj'(s + t) 
-ip'(s + t) rj)'(t) -ip'(s + t) 



573 D. EM algorithm principle 

From a constructive point of view, one often writes 

[x,z\e] = [x\e,z] x [ z \e], 

but using Bayes rule, we may write the reverse logarithmic form : 

In [x \6] = \n[x, z \6] — In [z \9,x] 

574 Let us remark that relation ID. II is valid whatever z represents. 



(D.l) 



575 D.l. Recall about EM algorithm and control of the gradient 

Under regularity conditions for the joint distribution [x, z \6] and the conditional 
one [z \ 8, x] , integrating relation [DJ] with respect to the probability density [z \ 9', x] : 



Q(9,6')-H(9,9') 



(D.2) 



The maximum of 9 i— > H(9, 9') is achieved in 9 = 9' 27 
So H(9,9') < H(6', 6'). Let us consider E2 for 9 and & 



In \x\9\- In [x \9'] = (Q(9, 9') - Q(9', 9')) + (H(9\ 9') - H(6, 9')) 
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EM algorithm is based upon an iterative procedure which exhibits 9 such that 
Q(6, 9') > Q(9', 9') . The best 9 is obtained by 

9 = argmax Q(9, 9') 
e 

During iteration we can monitor the value of the gradient for the log likelihood : 
dln[x|#] 91n[x, z\9] d\n[z\9,x] 



d9 39 39 



(D.3) 



Integrating the right hand term with respect to conditional density [z \9,x] ,and 
keeping in mind that, for any sufficiently regular pdf f(z; 9) of variable z with pa- 
rameter 9 one can write: J dln Q < f' e " > f(z; 9)dz = ~§e J 81n g^' 6> ' > f(z; 9)dz = 0, we have 

z z 

d\n\x\9\ [d\n\x,z\9] r , , [din \z \9, x] r ,„ , , 

'z \9, x\dz — / — [z \9, x\ dz 



89 J 89 J 89 

z 

dln[x\9} fd\ii[x,z\9 



89 / 89 



[z\9,x]dz (DA) 



577 We may use this equality (computed by Monte Carlo method) to perform a gra- 

578 dient method to obtain the maximum likelihood or just to check along the iterations 

579 that the gradient is going to zero. 

580 D.2. Score function 

From now on, let's call Sc(9,z,x) = 9ln ^^ 9 h he score, i.e the complete loglikeli- 
hood gradient and Sc(9i,z,x) = 91n ^ z \ 9 ^ its % th component. V9, equation ID. 41 proves 
that its conditional expectation (with respect to [z \9,x}) is always equal to the like- 
lihood gradient. Pushing the derivation game one step further leads to: 

8 fd\n[x\9}\ [ fdSa. d[z\e,x][z\6 7 x]\, 

[z \9, x] + Sc t — — } dz 



89 j \ 89i J J { 89j 1 1 ' 1 89 j [z\9,x 

z 

8 2 \n[x\9] f [ 8 2 In [x, z \9] f 8hi[x\9 



89 i 89 3 J \ 89 i 89 j 1 \ J 89 3 



s °i Sc j \* \°i x \ dz ( D - 5 ) 
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581 D.3. Information matrix 

To obtain the covariance matrix of the estimators at the maximum of likelihood, 
the empirical information matrix needs to be computed. The second order derivative 
is obtained by differentiating ID. 11 



d 2 \n[x\9] d 2 \n[x,z\9] d 2 '\n[z \9 ', x] 



ddiddj 89 i d9 j dOidOj 



(D.6) 



At the maximum 9 = 9, formula ID.3I implies 



dln[x\9] 



so that equation ID. 51 



takes a more friendly aspect because the score term 9l ^ 9 ^ 'm the right hand side 

vanishes at 9 = 6 . Equation ID.6I becomes therefore much more handy because it 
only involves conditional expectations of first and second derivatives of the complete 
likelihood terms : 



d 2 ln( 



x 



8 



ddiddj 



d 2 In 



X, z 


e 


d In 


x, z 


9 


d In 




9 






- + 













de 



89, 



9,x 



dz (D.7) 



As J 



8 In x, z\ 



9,x 



dz 



Sin 10 



0, the second term in the right hand 



583 side of eq ID. 71 can be considered as the conditional variance of the gradient of the 



584 complete log-likelihood In 



x,, 



This expectation can be numerically computed 



585 with the same techniques to which recourse was made for the EM algorithm. 



586 E. Detailed proofs of propositions 

58? E. 1 . Proof of proposition \3.1\ 

Since we detail the computation for one particular s, we will omit to mention it in 
order to make the reading easier. We also note respectively y, D_ and N_ the vectors 
of data, catching efforts and corresponding number of clumps in one stratum. 
We define J as 

J(*L>P,V) = [p,P,K\a,b,c,d,y,D] . (E.l) 
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Then J satisfies the following set of equations : 
« \y,P,V,K\a,b,c,d,D] 

] \ [ViWi,p] [Ni\n,Di] [p\a,b] [p\c,d] 



1 



cx ] I [ Vi \Ni,p,n] [Ni \p,p] {^e-^) (p^e^) 



vi=l 



with the convention that [A\B] oc f(A,B) means that the coefficient of proportion- 
ality only depends on B. We note /* the number of zero value y and we reorder the 
vector y so that the I + = I — I* non zero yi are the first, so that J may be written 
as : 



( n KNi*)e~^ J (p a - l e-» b ) (p c - x e-P d ) 

\i*=Z-/*+l / 

Defining Y + = J2i=i Vh N + = J2i=i N i and D + = J2i=i A, we obtain : 



J(N, P ,»)«\n r{Ni) »- Ni + 1) )e 



Vi I p -p(Y++d) N++C-1 -ti(D + +b) N++a-l 



p 1 e 1 'p 



Conditionally to the latent vector N_, the random effects p and p are independent. 
Isolating the terms which depend on p on one side and those depend on p on the 
other, we find that 

[fi\N, 6, y, D] ~ T(a + N + ,b + D+) 
[p\K,e,y] ~T(c + N + ,d + Y + ) 

For the expectation step we only need to compute E e (p s \ Y S Y E,g (la(p s ) | Y" s ) and 
the same sufficient statistics concerning p. 

Since p s \N s ,6,y follows a gamma distribution T(a + N s+ , b + D s+ ), the conditional 
expected value p s given N s and 9 = (a, b, c, d) is (a + N s+ ) /(b+ D s+ ). 
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Then 



If Z follows gamma distribution T(s,t), then E(ln(Z)) = ip(s) — ln(t) (see annex O), 
so that 

E e , (\n(p s ) | yj = (^(a' + iV s+ ) - ln(&' + D s+ ). 
We have respectively for p s 

and 

^ (ln(p s ) | = E„, (>(c' + iV s+ ) |yj - ln(d' + Y s+ ). 

588 E. 2. Proof of proposition \3.2\ 

589 Let us define J as the distribution of [p, p, N_\0' ,y, D_] in one particular stratum 

590 s. We will write J in a bottom-up perspective and consider the distribution of p and 

591 p conditionned by N, because p and p are conditionally independant. 

J is given by : 

J(N,p,p)= [p,p,N\6',y,D] 

= [p\N,9,y] [p\N,9',y,D] [N_\,9,y,D] 

Using the independent conditional gamma distributions of p and p and integrating 
according to p and p given N_, we can exhibit all the terms depending on TV. 



I I J{N,p,p)dpdp= [N\9,y,D] 
J p J ii 



^11 TVAnTVAT. + lJ 11 5{N * 



l Vr(jvi)r(jvi + i); v " ^ r( fl + jv+)r( c +jv. 



592 K 5". Proof of proposition 3.4 



In the following Z will stand for all the hidden variables i.e Z = (N, /x, p) , |Mjj | is 
another notation for matrix M that details the content of the i th row and j th column, 
and dF *P stands for the gradient of F written as a vector whose i th component is the 
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scalar 6 ^f' . The key equation involves rewriting equation ID .71 as the expectation 
of the second order derivative of the complete log-likelihood and the variance of the 



score (its gradient) to be taken with regards to the conditional distribution 
(see annex lD~3l) 



d 2 ln( 



x 



e 



dOidOj 



d 2 ln( 



x, Z 



9 



Yar 



Z\x 



dln[x,Z\6] 



x,e 



(E.2) 



Computing the first term of the right hand side of equation IE. 21 is easy, since 
[a;|;z, 0] = [x\z] (consequently the complete log-likelihood ln([a;, z \6]) can be sep- 
arated as In (fx \ z}) + \n([z,\9])) and the gamma random effects [z\9] belong to an 
exponential family. As a consequence, annex [F] shows that 



E 



d 2 ln( 



x, Z 



oe.de, 



d 2 ln( 



X, z 



89 i d9 j 



S 



/ 




1 

b 








\ 




i 
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a 
¥ 


















-V>'(c) 


1 

d 




\ 








1 

d 


—c 


J 



As shown in Figure [20j , given Y S ,Y S ' and 9, the latent variables Z s and Z s 
stratum s and s' are conditionnaly independent, therefore : 



of two 



Yar 



Z\x 



d\n[x,Z\9] 



^Yar 



Z s \x 



( ln (^y 

~Ps 

Hps) 

\ -Ps 



\ 



) 



To evaluate the variance of the score in stratum s, we will take advantage of successive 
conditioning due to the hierarchical structure depicted in Figure [2j Recalling that 
the latent variable Z s includes, in addition to (fi s ,p s ), the vector iV s , i-e the latent 
number of clumps for each record, the variance conditional decomposition formula 
gives: 



Yar 



Z a \x 



( Hps) \ 

-Ps 

Hps) 

-Ps 



( 



E 



Ns\x 



Yar 



( Hp 

-p. 



\ 



Hps) 

V ~Ps 



\ 



7 



■Yar 



N 3 \x 



( ( Hp 

E 

P 



\ 



^ s I AT 
HPs) ^ 



\ 



\ 
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Figure 20: The random effects in each stratum are conditionally independent given the data and 
the set of parameters 



So that we have 



W,x) = S 



1 

6 





1 
b 

— Jl 











\ 

o 

i 

c_ 



with 



A s — 



Yar 



\ 



"/is 



IN 



ln(p 



\ 



) 



and -B, = Yar 



N s \x 



( H»s) \ 



E 



V 



Hp*) '— 



\ 



J 



Given N s , fj, s and p s are independent. Moreover the pdf [p s \N s , Y s , a, b, c, d] and 
\Ps\N s , Y s , a, b, c, d] are gamma and analytic expressions are available for the expec- 
tation and variance of the gamma sufficient statistics, as detailed in equations lC.il to 
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IC.31 The key functions of N s+ are (a s f, b s f, c s f, d s f) = (a+N s+ , b+D s+ , c+N s+ , d+Y s+ ) 
such that : 

/ ln(/i s ) \ / HO-HK) \ 

-Us 



E 



N 



He's) ~ Hd's. 



\ 



7 



V ~Ps ) 

and then B s is obtained by taking the covariance of this vector : 



B, = Yar 



N s+ \e,x 



E 



V 



IN 

Hps) 



) 



Given N s additional advantage is taken from the conditional independence of p s and 
fx s as shown in Figure [2H • 




Figure 21: Given N, p s _L /i s 
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593 and the expression for A s follows easily. 
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594 F. Second derivative of the complete log-likelihood 

Let us first recall the complete log likelihood of the model : 

s s 
In [x, z \9] = C-e + (a — 1) In fi s + Sa In b — b fi s — Sdnl^a) 

s=l s=l 
S S 

+ (c — 1) In p s + Sc In d — d p s — S lnT(c) 

s=l s=l 

In the first derivative, the latent variables ^ and p appear not surprisingly only 
through their arithmetic or geometric means (sufficient statistics for the gamma pdf). 
Using standard notation p, for the arithmetic mean 4 Y2s=i we nave : 



9 In [x, z |0] 
da 



= £(Hu)+ln&-<0(a) 
91n[x, z |0] /a _ 



9 In [x, z \9] 

dc 

d In [x, ^ |0] 



5(ln(p)+lnd-^( C ; 



595 The gradient of the complete log-likelihood (so-called the " score" ) may be split into 

596 two parts : the first one A# does not depend on the latent variable z while the other 

597 one A z gathers terms depending on z (and possibly of 8), i.e : 



d In 



x, z 



09 



An + A, 



with 



/ ln6--0(a) \ 



An = S 



In d — ip(c) 



\ 



( ln(/i) \ 



A 7 = S 



J 



Hp) 
\ -p ) 

In addition here, A z does not contain terms with 9, consequently the second order 
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derivatives are easy to obtain and don't involve the latent variable 



d 2 In [x, z \9] 



dada 

d 2 In [x, z \9 

dadb 
d 2 In [x, z \9] 
dbdb 



Sip'(a) 



S 

= 1 
Sa 

1? 



d 2 ln[x,z\9] 

dcdc 
d 2 \n[x,z\9] 

dcdd 
d 2 hx[x,z\9] 

dddd 



-Sij'(c) 



S 
d 



-Sc 



598 G. Second derivative of the complete log-likelihood with discrete data 

The complete log likelihood of the model, in the discrete case, reads : 

s s 
In [x, z \9] = C-e + (a — 1) ^^ln/i s + Sa In b — b^^fi s — S'lnr(a) 

s=l s=l 

\ \ i \ i / s= i s= i 



p. 



In the first derivative, the latent variables fi and p appear only through their 
arithmetic or geometric means (sufficient statistics for the gamma and beta pdf). 
Using standard notation p, for the arithmetic mean 4 Y^ s =i A*s ; we have : 



d In [a;, z \9] 
da 



= S fln(/i) + In 6 — ip(a 
din [x, z \9] 



db 



S [ — — fJL 



d In [x, z \9] 
dc 

din [x, z \9] 
dd 



S ln(p)+^(c + d)-V(c) 



S ln(l -p) +ijj(c+d) -ip(d) 



599 The gradient of the complete log-likelihood (so-called the " score" ) may be split into 
eoo two parts : the first one A# does not depend on the latent variable z while the other 
sol one A z gathers terms depending on z (and possibly of 9), i.e : 



d In 



x, z 



d9 



Aa + A, 
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with 



A, 



5 



/ \nb-ip{a) \ 

a 
b 

ip(c + d) -tjj{c) 
V ^(c + d)-^(d) / 



S 



( Hp) \ 
-p 

ln(p) 



V ln(l-p) / 

In addition here, A 2 does not contain terms with 9, consequently the second order 
derivatives are easy to obtain and don't involve the latent variable; with Z standing 
for all the hidden variables i.e Z = (N, 



d 2 ln( 



x, z 



dOidOi 



S 



\ 



1 

b 






IP 











\ 






-tfj'(c) + tp'(c + d) f(c + d) 

ip'(c + d) -ip'(d) +ip'(c + d) J 



As shown in Figure [201 for the continuous case , given Y s , Y s i and 9, the latent variables 
Z s and Z s i of two strata s and s' are conditionnaly independent, therefore : 



Var : 



^ ' s=l 



( Hp*) \ 

~Ps 

hi(p s ) 
V ln(l-p s ) J 



To evaluate the variance of the score in stratum s, we will take advantage from 
successive conditioning due to the hierarchical structure depicted in Figure [2] still 
true for the discrete case. The variance conditional decomposition formula gives: 
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So that we have 



I £ (9,X)=S 



( ~il>'{a) 
l 
b 






l 

6 

_ JL 

¥ 









-ij'(c) + ij'(c + d) 

ij'(c + d) 








\ 



f{c + d) 
-^(d)+^(c+d) J 



s=l 
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with 



.4. 



E 



/ ln(/i s 



Var 



V 



IN 



-fJ>8 

ln(p s " 
Vln(l-p s ) / 



/ 



and £>, = Var 



/ ln(u s 



E 



V 



ln(p 8 ) 
\\n{l- Vs ) J J 



Given Ng, fi s and p s are independent. Moreover the pdf [p s \N s , Y s , a, b, c, d] and 
[p s \N s ,Y s ,a,b,c,d] are gamma and beta so that analytic expressions are available 
for the expectation and variance of the gamma sufficient statistics, as detailed in 
equations IC.f I to IC.3I The key functions of N s+ are (a s f, b s f, c s f, d s f) = (a + N s+ , b + 
D s+ , c + N s+ , d + Y s+ — N s+ ) such that : 



/ ln(/i, 



E 



\ 



IN 



MPs] 

\\n(l-p s ) J 



( V«) - HK) \ 

^(C s ) - + d' s ) 
\^d' s )-^c' s + d' s ) ) 



and then the matrix B s is obtained by taking the covariance of this vector. Given N s 
additional advantage is taken from the conditional independence of p s and fi s (as 
shown on Figure [2T1 for the continuous case). 



Var 



-Ps 



IN 



I 



\ 






^(d s )-f(c' s + d' s ) -iP'(c' s + d' s ) 

-f(c> s + d' s ) ^'(d' s )-f(c' s + d' s ) J 



a' 
6 /2 



and the expectation to obtain A s is performed via importance sampling. 
To sum it up 

d 2 \n[Y\9] 



W) 



(g.i; 



At the maximum likelihood estimator 6, the following equality occurs : 



W,Y) 



S 



( -ft (a) 
i 

6 






l 

6 

_ _a_ 

ft 2 






\ 



-V>'(c)+^(c + d) ^'(c+d) 

^'{c + d) -4>'{d) + f{c + d) J 



(G.2) 



s=l 
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with 



A, 



-i 

61 






bp 






and 






-E^K + d' s )) - n< + d' s ))J 



B, = Yar 



N s+ \e,x 



ij(c' s ) - ^(c' s + d> s ) 



603 where a' s = a + iV s +, = b + -D s+ , c' = c + iV s+ and d' s = d + Y s+ — N s+ ( b' s 

604 is the only term that is not a function of N s+ , thus behaving like a constant with 

605 regards to the Yar N ,§ x operator) 

606 H. The discrete algorithm 

607 If we adapt bluntly from the continuous version, the algoritm would write 



1. Generate = wherever y i=0 fori — I — I" 

2. Generate a value of N + according to 



1,...,/. 



N, oc 



T(a' + N+)T(d + N+)T(d' + Y + - N+)D 



+ 



(V + D + )"+»+]$Ur(Ni 



(you 



609 
610 



3. Generate each iVj for i = 1, . . . , J + , so that the vector N is distributed according 



to a multivariate hypegeometric Fisher distribution [17j given by 

g(N;N + ,Y,D/D + ) 



with 



[N\N, 



K 



9(N;N + ,Y,D)= fj (%.)(D j /D + ) 
i=i*+i \ 3 / 

yes 
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and 

{ i=i*+i 
4. Associate to the vector the weight 



w (g) 



k <3) n 



1 iV + (YD).\ 



i=I* 



Importance Sampling relying this time on the multivariate hypergeometric distribu- 
tion seems to stand naturally as the core of the algorithm to evaluate f)3.23p . But 
during our first trials, the above adaptation of the continuous version performed 
very badly, leading to a large variance of the importance weights, i.e. a degeneracy 
phenomenon that would put the all weight onto a very few contributing particles. In 
order to put more weight onto particles that have a good chance to efficiently attain 
the target distribution, a mixture was chosen as the importance distribution for a 
modified algorithm. The idea is similar in spirit to the auxiliary particle filtering of 



201 ] . More precisely, the first step consists of determining an approximate mean of 

acc 

T(a' + N+)T(d + N + )T(d' + y + - N + )D+ + 



N s+ in stratum s, denoted One draws a L-sample of N s+ according to 



g(N+) cx 



r(v + D + ) a '+ N +r{N + + i)r(jv+)r(y+ - n + + 1) 

The g distribution corresponds to the conditional distribution of N + given the sum 
of the data collected in stratum s b 
by the mean over a sample that is 



of the data collected in stratum s but ignoring the individual records. is given 



en and provides a good estimation of the location of N s+ . As previously we omit the 

612 index s to make the reading easier. Subsequently, the following algorithm relies on 

613 independent but non identically distributed simulations : 

6u 1. Generate = wherever yi = for i = I — I + + 1, . . . , I. 

2. Draw />) ~ T(a' + A^ e/) , V + D+) and ~ (3{c' + Jv£ e f) , d' + Y s+ - N^ f) ) 
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3. Given ^) anc i p ( 9 ) ; draw N { £ ~ [iV sfe |//^, y sk ] that is : 

rvW — ]A — k f » i9 ¥ 9) Di \ Ni 1 t 

[7V * - fcJ - Ki ^ i_ p (fl) J r(jvor(y 4 - jv, + i)r(jv, + 1) ^ 0<N ^ Y ^ 

where denotes the normalizing constant. 

4. Compute the weight of each particle g using 

i+ 



w 



(9)= n 



LA Ki(fj,Wp(s>)) N * V (b> + D a+ ) N »+ 
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